Skip to content

[Test] Add some gsm8k configs for hybrid models.#35406

Open
tdoublep wants to merge 1 commit into
vllm-project:mainfrom
tdoublep:tpa-hybrid-eval
Open

[Test] Add some gsm8k configs for hybrid models.#35406
tdoublep wants to merge 1 commit into
vllm-project:mainfrom
tdoublep:tpa-hybrid-eval

Conversation

@tdoublep
Copy link
Copy Markdown
Member

@tdoublep tdoublep commented Feb 26, 2026

Purpose

This PR adds some configs to the gsm8k testing framework that are very helpful for development on the hybrid models. I found this super helpful for debugging something I'm working on right now related to MTP + prefix caching + async scheduling.

Test Plan

They can be run with:

pytest -s -v tests/evals/gsm8k/test_gsm8k_correctness.py --config-list-file=tests/evals/gsm8k/configs/hybrid/models-h100.txt -k 'Qwen3-Next-FP8-TP4-MTP-Align'

Test Result

GSM8K Results for Qwen/Qwen3-Next-80B-A3B-Instruct-FP8:
  Measured metric: 0.8264
  Expected metric: 0.8500
  Tolerance: 0.0800
  Questions: 1319
  Invalid rate: 0.000
  Latency: 78.5s
  QPS: 16.8
✅ GSM8K test passed for Qwen/Qwen3-Next-80B-A3B-Instruct-FP8

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@tdoublep tdoublep requested a review from mgoin as a code owner February 26, 2026 15:20
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds several test configurations for gsm8k evaluation of hybrid models, specifically for Qwen3Next. My review found a critical issue in one of the new configuration files. The configuration for Qwen3-Next-FP8-TP4-MTP-Align enables prefix caching, which is not supported for hybrid models like Qwen3Next and will cause the engine to fail. This should be removed.

--max-model-len 4096
--tensor-parallel-size 4
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
--enable-prefix-caching
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The configuration enables prefix caching (--enable-prefix-caching) for the Qwen3Next model. Hybrid models like Qwen3Next do not support prefix caching, which will cause a ValueError during engine initialization. Please remove this argument.

@@ -0,0 +1,9 @@
model_name: "Qwen/Qwen3-Next-80B-A3B-Instruct-FP8"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just have this one? or is it also useful to test the non spec decoding / non-prefix caching case

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useful yeah. I put the 3 configs in a "hybrid" folder so as not to pollute what's there.

Alternatively, if we want to keep the number of configs to a minimum, maybe it could be useful to be able to pass additional overrides when passing the configs to pytest (if that isn't possible already).

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions Bot added the stale Over 90 days of inactivity label May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale Over 90 days of inactivity

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants